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Abstract Piatt's probabilistic outputs for Support Vector Machines (Piatt, J. in Smola, A., 
et al. (eds.) Advances in large margin classifiers. Cambridge, 2000) has been popular for 
applications that require posterior class probabilities. In this note, we propose an improved 
algorithm that theoretically converges and avoids numerical difficulties. A simple and ready- 
to-use pseudo code is included. 
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1 Introduction 

Given training examples jtj e W 1 , i = 1, . . . , /, labeled by y x ■ e {+1, — 1}, the binary Support 
Vector Machine (SVM) computes a decision function f(x) such that sign(f(x)) can be used 
to predict the label of any test example x . 

Instead of predicting the label, many applications require a posterior class probability 
Pr(_y = 1 |x). Piatt (2000) proposes approximating the posterior by a sigmoid function 

Pr(y = 1|jc) « P A , B (f) = I" * , p , , where / = f(x). (1) 

1 + exp(A/ + B) 

Let each f t be an estimate of f(xj). The best parameter setting z* = (A*, B*) is determined 
by solving the following regularized maximum likelihood problem (with N + of the y^'s 
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positive, and N- negative): 

min F(z) = - J^fa log(pi) + (1 - t t ) log(l - Pi )), 

z—(A,B) 

i = l 

for A - = P A , s a), and t t = N \ +2 ,i = l,...,L (2) 

1*1+2 ify/ = -l 

Piatt (2000) gives a pseudo code for solving (2). In this note, we show how the pseudo 
code could be improved. We analyze (2) in Sect. 2, and propose a more robust algorithm to 
solve it. Better implementation that avoids numerical difficulties is then discussed in Sect. 3. 
We compare our algorithm with Piatt's in Sect. 4. Finally, a ready-to-use pseudo code is in 
Appendix 3. 

2 Choice of optimization algorithm 

We first introduce the simple optimization algorithm used in Piatt's pseudo code (Piatt 
2000). Then, after proving that (2) is a convex optimization problem, we propose a more 
robust algorithm that enjoys similar simplicity, and theoretically converges. 

2.1 Piatt's approach: Levenberg-Marquardt method 

Piatt (2000) uses a Levenberg-Marquardt (LM) algorithm from Press et al. (1992) to 
solve (2). The LM method was originally designed for solving nonlinear least- square prob- 
lems. As an iterative procedure, at the k-th step, this method solves 

(H k + X k I)8 h = -VF(z k ) 

to obtain a direction 8k, and moves the solution from zu to Zk+i = Zk + 8k if the function value 
is sufficiently decreased. Here, Hk is a special approximation of the Hessian of the least- 
square problem, / is the identity matrix, and {zk}kLo * s me sequence of iteration vectors. 
When Xk is large, 8k is close to the negative gradient direction. On the contrary, a small X k 
leads 8k to be more like a Newton's direction. 

In the pseudo code, Piatt (2000) adapts the following rule for updating X k (Press et al. 
1992): 

If F(zk + 8 k ) < F(zk) then X k+ \ <- 0.1 • X k ; Else X k+ i <- 10 • Xk- 

That is, if the new solution decreases the function value, Xk is reduced, and in the next 
iteration a more aggressive direction like Newton's is tried. Otherwise, 8k is unacceptable so 
we increase X k to obtain a shorter vector which, more likely being a descent direction, may 
lower the function value. 

Unfortunately, such an implementation may not converge to the minimum solution of (2). 
To the best of our knowledge, existing convergence proofs (e.g., More 1978) all require more 
complicated or more robust rules for updating Xk. 

In fact, since (2) is not exactly a least-squares problem, the implementation of 
Piatt (2000) aims for general unconstrained optimization. It is known (e.g., Fletcher 1987) 
that for unconstrained optimization we should avoid directly dealing with Xk. Instead, the 
update of X k can be replaced by a trust-region concept, where the size of 8k is controlled. 
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Thus, currently the optimization community uses trust-region methods for unconstrained 
optimization and the LM method is considered as its "progenitor" (Nocedal and Wright 
1999). 

The LM-type implementation of Piatt (2000) has one advantage: simplicity. However, the 
above discussion shows that it may not be the best choice for solving (2). Next, we propose 
an algorithm that is also simple, but enjoys better convergence properties. 

2.2 Our approach: Newton's method with backtracking 

As indicated by Piatt (2000), any method for unconstrained optimization can be used for 
solving (2). Before we choose a suitable method, we analyze the optimization problem in 
more detail. First, the gradient VF(z) and the Hessian matrix H(z) = V 2 F(z) are computed: 



VF(z) = 



H(z) 



~Y!i=ifi(ti-Pi) 

_ Y!i=i(ti-Pi) 

~T!i=ifiP^-Pi) Y!i=ifiPi(\-Pi) 
_J2\=ifiPi( l - Pi) ELi a(i - n) _ 

Some analysis of this Hessian matrix is in the following theorem: 



Theorem 1 The Hessian matrix H(z) is positive semi- definite. In addition, H(z) is positive 
definite if and only z/mini<;</ fi / maxi</</ f t . 

The proof is in Appendix 1. Therefore, problem (2) is convex (and in general strictly 
convex). With such a nice property, we decide to use a simple Newton's method with back- 
tracking line search (Algorithm 6.2, Nocedal and Wright 1999, and Sect. 10.5, Nash and 
Sofer 1996). Though the trust-region type method mentioned in the end of Sect. 2.1 may 
be more robust, the implementation is more complicated. For this two-variable optimization 
problem, simplicity is important, and hence trust-region methods are less favorable. 



Algorithm 1 Newton's method with backtracking line search 

Input: Initial point zo, and parameter a > 0 such that H(z) + al is 
positive definite for all z 
1: for fc = 0,1,2,... do 
2: Solve (H k + aI)S k = -VF{z k ) 

3: Find a k as the first element of the sequence 1, ^, |, . . . to satisfy 
F{z k + a k S k ) < F(z k ) + 0.0001 • a k (VF(z k ) T 5 k ) (3) 

4: Set z k+1 = z k + a k S k 
5: end for 

Our proposed algorithm is in Algorithm 1. As H k = H(zk) may be singular, a small 
positive diagonal matrix is added to the Hessian. With 

VF{z k ) T h = -VF(z k ) T (H k + aiy l VF(z k ) < 0, 

the step size a k can be backtracked until the sufficient decrease condition (3) is satisfied. 
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If H(z) is positive definite for all z, the convergence of Algorithm 1 can be established 
from, for example, Theorem 10.2 by Nash and Sofer (1996). A simplified statement is shown 
in Theorem 2. 

Theorem 2 (Convergence of Algorithm I for general F(z)) 

If F(z) is twice continuously dijferentiable, H(z) is positive definite for all z, and F(z) 
attains an optimal solution at z*, then lim^oo Zk = z*. 

From Theorem 1, in some rare situations, H(z) is positive semi-definite but not positive 
definite. Then, Theorem 2 cannot be directly applied. In Appendix 2, we show that if a > 0, 
Algorithm 1 still converges to an optimal solution. Therefore, we get the following theorem: 

Theorem 3 (Convergence of Algorithm 1 for (2)) 

If Algorithm 1 is applied to (2) such that H(z) + a I is always positive definite, then 
Hindoo Zk exists and is a global optimal solution. 



3 Numerical implementation 

Next, we study the numerical difficulties that arise when solving (2) using Piatt's pseudo 
code. Then, we show our implementation that avoids the difficulties. 

3 . 1 Piatt' s implementation 

Piatt (2000) uses the following pseudo code to calculate the objective value of (2) for a new 
pair of (A, B): 

for i = 1 to len { 

p = 1/ (1+exp (deci [i] *A+B) ) 

//At this step, make sure log(0) returns -2 00 
err -= t*log (p) + (1-t ) *log (1-p) 

} 

Here, len is /, the number of examples used, and err is the objective value. In addition, 
deci [ i ] is f, and hence p stores the calculated p t . However, t was lastly assigned to t\ 
before this loop, and the calculation does not use all ^ , / = 1, . . . , / . Therefore, this pseudo 
code does not correctly calculate the objective value of (2). 

Furthermore, the above code assumes that log(0) returns —200, which reveals possible 
numerical difficulties: 

1. log and exp could easily cause an overflow. If Af + B is large, exp(A/} + B) — ► oo. In 
addition, when p t is near zero, \og(p t ) —> — oo. Although these problems do not always 
happen, considering log(0) to be —200 is not a good solution. 

2. 1 — pi = 1 — 1+exp( \y. + ff) is a "catastrophic cancellation" (Goldberg 1991) when p t is 
close to one. That is, when subtracting two nearby numbers that are already results of 
floating-point operations, the relative error can be so large that most digits are mean- 
ingless. For example, if f = 1, and (A, B) = (—64, 0), in a simple C++ program with 
double precision, 1 — p t returns zero but its equivalent form ^^^f+B) §i yes a more 
accurate result. This catastrophic cancellation actually introduces most of the log(0) oc- 
currences. 

Almost all algorithms that solve (2) need to face these issues. Next, we will discuss some 
techniques to resolve them. 
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3.2 Our implementation 

A problem of catastrophic cancellation can usually be resolved by reformulation: 

-(tilogpi + il-tOlogil-pi)) (4) 
= (ft - l){Aft + B) + log(l + exp(Afi + B)) (5) 
= U (Afi + B) + log(l + exp(-A/i - B)). (6) 

With (5) or (6), 1 — p t does not appear. Moreover, log(0) never happens. 1 

Note, however, that even if (5) or (6) is used, the overflow problem may still occur. The 
problem is not serious if the IEEE floating-point standard is supported (Goldberg 1991): an 
overflow leads to a special number INF, which can still be used in further operations. For 
example, if a large in Line 3 of Algorithm 1 makes the exp operation of (5) to overflow 
for some Afi + B, the new objective value would also be evaluated as INF. Then, under the 
IEEE standard, INF is bigger than the current F(zk), and hence is reduced to a smaller 
value, with which Af + B may not cause an overflow again. 

Furthermore, regardless of whether the IEEE standard is supported, we can replace an 
overflow operation with an underflow one, a rule-of-thumb which has been frequently used 
in numerical computation. In general, an underflow is much less disastrous than an overflow. 
Therefore, we propose implementing (4) with the rule: 

If Afi + B > 0 then use (6); Else use (5). 

In addition, we can evaluate (1) by a similar trick: 

exp(-Af - B) 

If Af + B > 0 then use — — ; Else use (1 ). 

J 1 + exp(- Af -B) K ' 

The trick can be used in calculating VF(z) and H(z) as well: The term 1 — p t in H(z) 
can also cause a catastrophic cancellation. An easy solution is to replace 1 — p t with the 
rule: 

1 exp(A/; + B) 

If Af + B>0 then use ; Else use 1 



1 + txvi-Af -B) 1 + exp(Af + B) 



4 Experiment 

We implemented Piatt's pseudo code (Piatt 2000), fixed the bug that was discussed in the 
beginning of Sect. 3.1, and compared it to our proposed algorithm. For fairness, both algo- 
rithms were realized in python, and were set with a stopping condition || VF(zk) Woo < 10~ 5 . 
For the value of a in Algorithm 1, we considered two approaches: 

- fixed: use a small fixed a = 10~ 12 . 

- dynamic: apply Theorem 1 to check whether H(z) is positive definite, and set a = 0 
instead if the condition is true. 

We compared the algorithms on two UCI data sets, sonar and shuttle (Newman et al. 
1998). Only classes 2 and 4 were taken from shuttle to form a binary problem. The values f 



As pointed out by a reviewer, in many popular languages, log ( 1+ . . . ) can be replaced by loglp ( . . . ) 
to compute the result more accurately when the operand exp (Afi + B) or exp (—Af — B) is close to zero. 



^ Springer 



272 



Mach Learn (2007) 68: 267-276 



Table 1 Average results of different algorithms for solving (2) on sonar 

Algorithm # overflow # iterations Final # backtracking 

errors F(z) steps per iteration 

Piatt's 0 5.77 107.78 

ours, fixed 0 5.56 107.78 0 

ours, dynamic 0 5.56 107.78 0 



Table 2 Average results of different algorithms for solving (2) on shuttle 

Algorithm # overflow # iterations Final # backtracking 

errors F(z) steps per iteration 

Piatt's 589.30 8.00 158.62 

ours, fixed 0 6.66 157.83 0.17 

ours, dynamic 0 6.68 157.83 0.24 



were generated with the scaled data sets by LIBSVM using the RBF kernel (Chang and 
Lin 2001). The soft-margin parameter log 2 C was varied in— 5, —3, ...,15, and the kernel 
parameter log 2 y was varied in —15, —13,. ..,3. That is, 110 different problems (2) were 
tested for each data set. 

Tables 1 and 2 list the average results for each data set. We first compared each algorithm 
based on the number of overflow errors encountered, the number of iterations, and the final 
objective value F(z). While Piatt's algorithm did reasonably well on sonar, it encountered 
numerous overflow errors on shuttle, needed more iterations, and sometimes could not return 
a solution with decent F(z) . On the other hand, our proposed algorithm worked well on both 
data sets. 

The number of backtracking steps per iteration was also listed for the two approaches of 
setting a. We can see that the fixed approach needed less backtracking steps per iteration on 
shuttle. The benefit came from the regularization on some nearly singular H(z). In addition, 
the fixed approach is simpler to implement in practice, and hence shall be preferred. 

Finally, a simple and robust code is in Appendix 3. It has been integrated into LIBSVM 
since version 2.6 (Chang and Lin 2001). Source code in several popular languages can be 
downloaded at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools. 

Acknowledgements We thank John Piatt, S. Sathiya Keerthi, and the anonymous reviewers for helpful 
comments. 



Appendix 1 Proof of Theorem 1 



Since the definition of p t in (1) implies that 0 < pt < 1, we can define vectors u and v with 
Ui = fi*J pi(\ — pi), and Vi = Va(1 — Pi), respectively. Then 



H(z) = 
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By Cauchy inequality, 



temz))=^u^^v^-^Uiv?j >0. (7) 

Since the two diagonal terms and the determinant are all nonnegative, the matrix H(z) is 
positive semi-definite. 

From (7), det(//(z)) = 0 if and only if u and v are parallel vectors. Since u t = f t Vi and 
Vi > 0, this situation happens if and only if all /}'s are equal. That is, the matrix H(z) is 
positive definite if and only if mini</</ f t / maxi</</ f t . 



Appendix 2 Proof of Theorem 3 

Case 1: H(z) is always positive definite. If one can prove that 

S = {(A, B): F(A, B) < F(A 0 , B 0 )} (8) 

is bounded, then F(A, B) attains an optimal solution within S and Theorem 2 can be applied 
to show the convergence. 

From Theorem 1, assume without loss of generality that f\ / f 2 . Let 



Since 



/i i 
fi \ 

is invertible, it suffices to show that S = {a: (A, B) e S} is bounded. If not, there exists an 
infinite sequence m S such that 

lim max(|(^)i|, \ (a k ) 2 \) = oo. 

Then, without loss of generality, there exists an infinite subsequence AC such that 
lim^oo :kGjc |fe)i I = oo. However, since F(A k , B k ) is the summation of positive terms, 

I e (a k )\ 
F(A k , B k ) > - h log TT ^ 7 - (1 - h ) log TT ^ 7 . 

The right-hand- side above goes to oo as \{a k )\ \ -> oo. Therefore, there exists some k such 
that F(A k , B k ) > F(A 0 , B 0 ), which somehow contradicts a k e S. Thus, S is bounded and 
the proof is complete. 

Case 2: When H(z) is only positive semi-definite for some z, from Theorem 1, all /}'s 
are equal. By considering f t = f for all /, we can define a = Af + B and a single-variable 
function F(a) = F(A, B). Then 

I -» le a 

F\a) =yn , F"(a) = -. 

W \ + e a W (1 + e a ) 2 
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By simplifying (3), in Algorithm 1, (H(z) + crl)8 = -VF(z) is 

V d+^)2 [/ ij + \tr i+*7 



(9) 



If a > 0, the solution 5 satisfies = / • (<5) 2 . Then, the first (and the second) equation of 
the linear system (9) is the same as 



F"(a) + 



(10) 



Interestingly, if we apply Algorithm 1 to minimize F(a) with jf^ added to its Hessian 

F"(a), (10) is exactly the linear system to be solved. Therefore, if <2 0 = A 0 f + B 0 , then for 
all k, 



ajc+i =a k + a k (f • (8 k )i + (<$*) 2 ) 

= (A k + a k (S k h)f + (ft + a*(« 2 ). 



(11) 



Since F(a) is strictly convex from F"(a) > 0, similar techniques in Case 1 can be used 
to prove that F(a) attains an optimal solution. Therefore, from Theorem 2, the sequence 
{a k }^ ={) globally converges. Then, from (8 k )\ = / • {8 k ) 2 and (11), 

00 

lim a k = (A 0 f + B 0 ) + (f 2 + 1) YV(« t ) 2 



k=0 



exists. Therefore, lim^oo B k = B 0 + YlT=o a k(^k)i exists, and so does lim^oo A k . In addi- 
tion, they form an optimal solution of minimizing F(A, B). 



Appendix 3 Pseudo code of Algorithm 1 

We recommend using double precision for the algorithm. 

Input parameters : 

deci = array of SVM decision values 

label = array of booleans : is the example labeled +1? 
priori = number of positive examples 
priorO = number of negative examples 
Outputs : 

A, B = parameters of sigmoid 

//Parameter setting 

maxiter=100 //Maximum number of iterations 

minstep=le-10 //Minimum step taken in line search 
sigma=le-12 //Set to any value > 0 

//Construct initial values: target support in array t, 
// initial function value in fval 

hiTarget= (priorl+1 . 0) / (priorl+2 . 0) , loTarget=l/ (priorO+2 . 0) 
len=priorl+prior0 // Total number of data 
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for i = 1 to len { 
if (label [i] > 0) 

t [i] =hiTarget 
else 

t [i] =loTarget 

} 

A=0.0, B=log( (prior0+1.0) / (priorl+1.0) ) , fval=0 . 0 
for i = 1 to len { 

fApB=deci [i] *A+B 

if ( f ApB >= 0) 

fval += t [i] *fApB+log(l+exp(-fApB) ) 

else 

fval += (t [i] -1) *fApB+log(l+exp(fApB) ) 

} 

for it = 1 to maxiter { 

//Update Gradient and Hessian (use H' = H + sigma I) 
hll=h22=sigma, h21=gl=g2=0 . 0 
for i = 1 to len { 
fApB=deci [i] *A+B 
if ( f ApB >= 0) 

p=exp(-fApB) / (1.0+exp(-fApB) ) , q=1.0/ ( 1 . 0+exp ( -f ApB) ) 
else 

p=1.0/ (1. 0+exp (f ApB) ) , q=exp(fApB) / ( 1 . 0+exp ( f ApB) ) 
d2=p*q 

hll += deci [i] *deci [i] *d2, h22 += 6.2, h21 += deci[i]*d2 
dl=t [i] -p 

gl += deci[i]*dl, g2 += dl 

} 

if (abs (gl ) <le-5 && abs (g2 ) <le-5 ) //Stopping criteria 
break 

//Compute modified Newton directions 
det=hll*h2 2-h21*h21 

dA=- (h22*gl-h21*g2) /det, dB=- ( -h2 l*gl+hll*g2 ) /det 

gd=gl*dA+g2*dB 

stepsize=l 

while (stepsize >= minstep) { //Line search 

newA=A+stepsize*dA, newB=B+stepsize*dB, newf =0 . 0 
for i = 1 to len { 

fApB=deci [i] *newA+newB 
if ( f ApB >= 0) 

newf += t [i] *fApB+log(l+exp(-fApB) ) 
else 

newf += (t [i] -1) *fApB+log(l+exp(fApB) ) 

} 

if (newf<fval+0 . 0001*stepsize*gd) { 
A=newA, B=newB , f val=newf 
break //Sufficient decrease satisfied 

} 
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else 

stepsize /= 2.0 

} 

if (stepsize < minstep) { 
print 'Line search fails' 
break 

} 

} 

if (it >= maxiter) 

print 'Reaching maximum iterations' 

return [A,B] 
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